OpenStreetMap Data Wrangling with SQL

BY Sheng BI

Project Overview

  1. Select an area and download the data from OpenStreetMap.

  2. Audit the data, investigate data using SQL and Pandas.

Map Area

West hollywood, CA, United States

This is a place where a lot of celebrities live, and I am interested in how do people live their life there. The data source is OpenStreetMap. The following is an overview of the map.

In [33]:
from IPython.display import HTML
In [34]:
HTML('<iframe width="425" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://www.openstreetmap.org/export/embed.html?bbox=-118.3918%2C34.0754%2C-118.3320%2C34.0996&amp;layer=mapnik"></iframe><br/>')
Out[34]:

Problems encoutered in the data

After downloading the data and studying a small sample size of the this area, I noticed the following problems with the data:

  1. Inconsistent postal codes
  2. Inconsistent phone numbers
  3. Street names are abbreviated.

I will deal with these problems in the above order.

1. Inconsistent postal codes

All postal codes are summarized and the result is saved as a dictionary, with the value being its count of occurrence.

code
{'90004': 1,
 '90027': 1,
 '90028': 4,
 '90036': 10,
 '90038': 5,
 '90046': 124,
 '90046-4101': 1,
 '90048': 9,
 '90069': 612,
 '90210': 47,
 'CA 90036': 1,
 'CA 90048': 1,
 'CA 90069': 2}

I proceed data cleaning in 2 steps.

Step 1. standardize the format to 5-digit string.

  • e.g. 'CA 90036' -> '90036' and '90046-4101' -> '90046'

Step 2. add the city name behind the 5-digit postcode.

  • e.g. '90069' -> '90069, West Hollywood

In the second step, I match the city name with the postal codes by parsing the website http://www.zipcodestogo.com/California/. The result shows that West Hollywood is indeed the area of our focus. Nodes from other areas are included, because we had to draw a rectangular when selecting the desired area on OpenStreetMap.

code
{'90004': '90004, Los Angeles',
 '90027': '90027, Los Angeles',
 '90028': '90028, Los Angeles',
 '90036': '90036, Los Angeles',
 '90038': '90038, Los Angeles',
 '90046': '90046, Los Angeles',
 '90046-4101': '90046, Los Angeles',
 '90048': '90048, Los Angeles',
 '90069': '90069, West Hollywood',
 '90210': '90210, Beverly Hills',
 'CA 90036': '90036, Los Angeles',
 'CA 90048': '90048, Los Angeles',
 'CA 90069': '90069, West Hollywood'}

2. Inconsistent phone numbers

The desired format for the phone number is +1-213-224-4153, that is,

  1. the number starts with '+1';
  2. the number is of 11 digits;
  3. the '-' appears at the 2nd, 5th, 8th digit place.

However, after querying the data, we find the following inconsistent formats:

{'(323) 464-2989': 1,
 '(323) 654-4411': 1,
 '(323) 654-6686': 1,
 '(323) 874-6700': 1,
 '(323) 934-1121': 1,
 '+1 (310) 980 8064': 1,
 '+1 (323) 857-1882': 1,
 '+1 (323) 928-3002': 1,
 '+1 (323) 944-0855': 1,
 '+1 310 3600916': 1,
 '+1 310-652-5091': 1,
 '+1 310-854-3488': 1,
 '+1 323 654 7125': 1,
 '+1 323 6548271': 1,
 '+1 323 747 1388': 1,
 '+1 323-654-4222': 1,
 '+1 323-654-8713': 1,
 '+1- 323-978-2170': 1,
 '+1-1-310-424-1600': 1,
 '+1-323-650-057': 1,
 '+1-800-PINKDOT': 1,
 '+13239931700': 1,
 '0013239366154': 1,
 '013239514800': 1,
 '1-323-871-8318': 1,
 '1-323-874-7924': 1,
 '310 289 2000': 1,
 '310-720-3809': 1,
 '323-379-2091': 1,
 '323-415-6860': 1,
 '323-424-7731': 1,
 '323-540-4551': 1,
 '323-697-5338': 1,
 '323-931-1466': 1,
 '323-937-2801': 1,
 '323-988-1119': 1,
 '855-888-5575': 1}

These bad formats may belong to one or several of the following categories:

  1. lack of '+1' at the beginning. e.g.'323-697-5338'
  2. space instead of hyphen. e.g. '+1 323 654 7125'
  3. redudant spaces. e.g. '+1 323-654-8713', '+1- 323-978-2170'.
  4. parenthese. e.g. '+1 (323) 944-0855', '(323) 654-4411'
  5. lack of hyphens. e.g. '+13239931700'
  6. redudant zeros. e.g. '0013239366154','013239514800'
  7. one digit missing. e.g. '+1-323-650-057'
  8. service hotline. e.g. '+1-800-PINKDOT' (<- This is acceptable)

I will concentrate on (1)-(6). Notice that (8) is of correct format, and (7) could not be dealt with without further information. After cleaning, we have:

{'(323) 464-2989': '+1-323-464-2989',
 '(323) 654-4411': '+1-323-654-4411',
 '(323) 654-6686': '+1-323-654-6686',
 '(323) 874-6700': '+1-323-874-6700',
 '(323) 934-1121': '+1-323-934-1121',
 '+1 (310) 980 8064': '+1-310-980-8064',
 '+1 (323) 857-1882': '+1-323-857-1882',
 '+1 (323) 928-3002': '+1-323-928-3002',
 '+1 (323) 944-0855': '+1-323-944-0855',
 '+1 310 3600916': '+1-310-360-0916',
 '+1 310-652-5091': '+1-310-652-5091',
 '+1 310-854-3488': '+1-310-854-3488',
 '+1 323 654 7125': '+1-323-654-7125',
 '+1 323 6548271': '+1-323-654-8271',
 '+1 323 747 1388': '+1-323-747-1388',
 '+1 323-654-4222': '+1-323-654-4222',
 '+1 323-654-8713': '+1-323-654-8713',
 '+1- 323-978-2170': '+1-323-978-2170',
 '+1-1-310-424-1600': '+1-310-424-1600',
 '+1-323-650-057': '+1-323-650-057',
 '+1-800-PINKDOT': '+1-800-PINKDOT',
 '+13239931700': '+1-323-993-1700',
 '0013239366154': '+1-323-936-6154',
 '013239514800': '+1-323-951-4800',
 '1-323-871-8318': '+1-323-871-8318',
 '1-323-874-7924': '+1-323-874-7924',
 '310 289 2000': '+1-310-289-2000',
 '310-720-3809': '+1-310-720-3809',
 '323-379-2091': '+1-323-379-2091',
 '323-415-6860': '+1-323-415-6860',
 '323-424-7731': '+1-323-424-7731',
 '323-540-4551': '+1-323-540-4551',
 '323-697-5338': '+1-323-697-5338',
 '323-931-1466': '+1-323-931-1466',
 '323-937-2801': '+1-323-937-2801',
 '323-988-1119': '+1-323-988-1119',
 '855-888-5575': '+1-855-888-5575'}

3. Abbreviated street names

We could observe that some of the street names are either abbreviated or mistyped,

{'Ave': set(['N La Brea Ave']),
 'Blvd': set(['7290 Beverly Blvd',
              'North Robertson Blvd',
              'Santa Monica Blvd',
              'Sunset Blvd']),
 'Blvd.': set(['North Robertson Blvd.', 'Santa Monica Blvd.']),
 'avenue': set(['north la brea avenue']),
 'blvd': set(['sunset blvd'])}

The following cleanings are implemented:


sunset blvd => sunset Boulevard
North Mansfield => North Mansfield
Sunset Blvd => Sunset Boulevard
Santa Monica Blvd => Santa Monica Boulevard
7290 Beverly Blvd => 7290 Beverly Boulevard
North Robertson Blvd => North Robertson Boulevard
north la brea avenue => north la brea Avenue => North la brea Avenue
N La Brea Ave => N La Brea Avenue => North La Brea Avenue
North Robertson Blvd. => North Robertson Boulevard
Santa Monica Blvd. => Santa Monica Boulevard
In [ ]:
 

Data overview and Analysis

File Size

Size of the file ....... 58,913 KB

nodes.csv .............. 23,462 KB

nodes_tags.csv ......... 293 KB

ways.csv ............... 1,595 KB

ways_tags.csv .......... 6,761 KB

ways_nodes.cv .......... 4,979 KB`

My analysis consists of two parts.

Part 1: map data analysis

Part 2: user data analysis

I use SQL quries to present basic statistics, and discuss some ideas based on the observations.

Part 1. Map data analysis


To facilitate analysis, I will first create an outer union of the following two tables using sql: nodes_tags and ways_tags.

In [ ]:
cur.executescript('''
CREATE TABLE tags_nodes_ways (
    id INTEGER,
    key TEXT,
    value TEXT,
    type TEXT,
    FOREIGN KEY (id) REFERENCES nodes(id)
);

INSERT INTO tags_nodes_ways 
    SELECT * FROM nodes_tags
    UNION ALL
    SELECT * FROM ways_tags;

''')

Sort amenities by count

In [42]:
cur.execute('''
select A.value, count(*) as num
from tags_nodes_ways as A 
where A.key == 'amenity'
group by A.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
[(u'restaurant', 118),
 (u'parking', 97),
 (u'waste_basket', 47),
 (u'cafe', 31),
 (u'place_of_worship', 24),
 (u'fast_food', 22),
 (u'school', 21),
 (u'fuel', 15),
 (u'bank', 14),
 (u'bar', 14),
 (u'studio', 14),
 (u'atm', 10),
 (u'nightclub', 10),
 (u'theatre', 10),
 (u'post_box', 9),
 (u'pharmacy', 8),
 (u'hospital', 7),
 (u'post_office', 7),
 (u'ice_cream', 5),
 (u'vending_machine', 5),
 (u'whirlpool', 5),
 (u'car_rental', 4),
 (u'dentist', 4),
 (u'fountain', 4),
 (u'library', 4),
 (u'veterinary', 4),
 (u'arts_centre', 3),
 (u'bicycle_rental', 3),
 (u'cinema', 3),
 (u'fire_station', 3),
 (u'bench', 2),
 (u'doctors', 2),
 (u'police', 2),
 (u'shelter', 2),
 (u'brothel', 1),
 (u'car_wash', 1),
 (u'dojo', 1),
 (u'money_transfer', 1),
 (u'pub', 1),
 (u'public_building', 1),
 (u'telephone', 1),
 (u'townhall', 1)]

Sort religions by count

In [43]:
cur.execute('''
select tags_nodes_ways.value, count(*) as num
from tags_nodes_ways
join (select distinct(id) 
      from tags_nodes_ways
      where value == 'place_of_worship') A
  on A.id == tags_nodes_ways.id
where tags_nodes_ways.key == 'religion'
group by tags_nodes_ways.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
[(u'christian', 14), (u'jewish', 8), (u'buddhist', 1)]

Sort Restaurants by count.

Observations:

  1. Italian and Japanese food are especially popular.
  2. Fast food is popular, such as Sushi, Burger, Pizza, Sandwich.
In [45]:
cur.execute('''
select tags_nodes_ways.value, count(*) as num
from tags_nodes_ways
join (select distinct(id) 
      from tags_nodes_ways
      where value == 'restaurant' or value == 'fast_food') A
  on A.id == tags_nodes_ways.id
where tags_nodes_ways.key == 'cuisine'
group by tags_nodes_ways.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
[(u'burger', 10),
 (u'italian', 8),
 (u'japanese', 8),
 (u'mexican', 8),
 (u'pizza', 7),
 (u'american', 5),
 (u'sandwich', 4),
 (u'sushi', 4),
 (u'chinese', 3),
 (u'steak_house', 3),
 (u'thai', 3),
 (u'indian', 2),
 (u'pancake', 2),
 (u'seafood', 2),
 (u'Upscale_Comfort_Food', 1),
 (u'american;fine_dining', 1),
 (u'asian', 1),
 (u'barbecue', 1),
 (u'brazilian', 1),
 (u'burger;hotdog', 1),
 (u'burger;mexican', 1),
 (u'chicken', 1),
 (u'corsican', 1),
 (u'greek', 1),
 (u'grill', 1),
 (u'hawaiian', 1),
 (u'hotdog;burger', 1),
 (u'kebab', 1),
 (u'mediterranean', 1),
 (u'mexican;vegetarian', 1),
 (u'peruvian', 1),
 (u'pizza,_burger,_american', 1),
 (u'ramen;japanese', 1),
 (u'russian', 1),
 (u'salad', 1),
 (u'spanish;sandwich', 1),
 (u'tacos', 1),
 (u'vietnamese', 1)]

Sort Banks by count: Chase and Wells Fargo both have 4 branches.

In [46]:
cur.execute('''
select tags_nodes_ways.id, tags_nodes_ways.key, tags_nodes_ways.value, 
       count(*) as num
from tags_nodes_ways
join (select distinct(id) 
      from tags_nodes_ways
      where value == 'bank') A
  on A.id == tags_nodes_ways.id
where tags_nodes_ways.key == 'name'
group by tags_nodes_ways.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
[(440269302, u'name', u'Chase', 4),
 (5247316878L, u'name', u'Wells Fargo', 4),
 (391157722, u'name', u'Bank of America', 2),
 (425405625, u'name', u'Citibank', 1),
 (391157716, u'name', u'U.S. Bank', 1),
 (429386287, u'name', u'Union Bank', 1),
 (5206110790L, u'name', u'Western Union', 1)]

Sort postal codes by count

In [47]:
cur.execute('''
select tags_nodes_ways.value, count(*) as num
from tags_nodes_ways
where tags_nodes_ways.key == 'postcode'
group by tags_nodes_ways.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
[(u'90069, West Hollywood', 601),
 (u'90046, Los Angeles', 124),
 (u'90210, Beverly Hills', 46),
 (u'90036, Los Angeles', 11),
 (u'90048, Los Angeles', 10),
 (u'90038, Los Angeles', 5),
 (u'90028, Los Angeles', 4),
 (u'90004, Los Angeles', 1),
 (u'90027, Los Angeles', 1)]

Sort leisure by count:

  • people love doing sports. In particular, swimming seems to be the most popular activity, which is followed by baseball
In [48]:
cur.execute('''
select A.value, count(*) as num
from tags_nodes_ways as A 
where A.key == 'leisure'
group by A.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
[(u'swimming_pool', 701),
 (u'pitch', 49),
 (u'park', 15),
 (u'fitness_centre', 9),
 (u'sports_centre', 6),
 (u'playground', 4),
 (u'tanning_salon', 1),
 (u'track', 1)]

Sort shop by count:

  • clothes,beauty,hairdresser,jewelry and shoes have high rankings.
    • I would conclude that fashion consumption plays an essential role in this area.
In [49]:
cur.execute('''
select A.value, count(*) as num
from tags_nodes_ways as A 
where A.key == 'shop'
group by A.value
having num >= 3
order by num desc
;
''')
pprint.pprint(cur.fetchall())
[(u'clothes', 73),
 (u'beauty', 43),
 (u'hairdresser', 24),
 (u'jewelry', 9),
 (u'shoes', 9),
 (u'supermarket', 9),
 (u'convenience', 7),
 (u'department_store', 7),
 (u'tattoo', 7),
 (u'alcohol', 6),
 (u'laundry', 6),
 (u'mall', 6),
 (u'massage', 6),
 (u'tobacco', 6),
 (u'gift', 5),
 (u'nutrition_supplements', 5),
 (u'optician', 5),
 (u'pet', 5),
 (u'antiques', 4),
 (u'books', 4),
 (u'boutique', 4),
 (u'electronics', 4),
 (u'furniture', 4),
 (u'yes', 4),
 (u'art', 3),
 (u'car', 3),
 (u'car_repair', 3),
 (u'sports', 3),
 (u'watches', 3)]

Sort tourism by count

It seems that a lot of people here for vacation:

  • the number of attractions is low
  • the number of hotels is high
In [50]:
cur.execute('''
select A.value, count(*) as num
from tags_nodes_ways as A 
where A.key == 'tourism'
group by A.value
having num >= 3
order by num desc
;
''')
pprint.pprint(cur.fetchall())
[(u'hotel', 42),
 (u'motel', 9),
 (u'attraction', 6),
 (u'guest_house', 4),
 (u'artwork', 3),
 (u'yes', 3)]
Addtional Ideas:

From the results, I notice that the same data entry often appears in different names. I could give several examples.

Example 1: source of data.

  • As we could see, the search engine Bing appears in the following forms: u'TIGER, Bing, u'Bing, u'Bing; TIGER 2012, u'bing, u'Tiger2011; Bing.
In [51]:
cur.execute('''
select tags_nodes_ways.value, count(*) as num
from tags_nodes_ways
where tags_nodes_ways.key == 'source'
group by tags_nodes_ways.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
[(u'mapillary', 575),
 (u'TIGER, Bing', 296),
 (u'survey', 36),
 (u'Bing', 30),
 (u'USGS Geonames', 16),
 (u'Bing; TIGER 2012', 5),
 (u'GPS; survey; Yahoo', 3),
 (u'TIGER/Line\xae 2008 Place Shapefiles (http://www.census.gov/geo/www/tiger/)',
  3),
 (u'bing', 2),
 (u'satellite', 2),
 (u'Jason Sodenkamp - Proprietor', 1),
 (u'Los Angeles Fire Department', 1),
 (u'Tiger2011;Bing', 1),
 (u'Yahoo', 1),
 (u'solar', 1)]

To obtain correct number of source of Bing, the following code works:

In [52]:
cur.execute('''
select tags_nodes_ways.value, count(*) as num
from tags_nodes_ways
where tags_nodes_ways.key == 'source' and tags_nodes_ways.value like '%bing%'
order by num desc
;
''')
pprint.pprint(cur.fetchall())
[(u'Bing', 334)]

Example 2: name of coffee shops.

  • Starbucks appears as u'Starbucks Coffee' and u'Starbucks'.
In [53]:
cur.execute('''
select * 
from (select tags_nodes_ways.id, tags_nodes_ways.key, tags_nodes_ways.value, 
       count(*) as num
      from tags_nodes_ways
      join (select distinct(id) 
            from tags_nodes_ways
            where value == 'cafe') A
      on A.id == tags_nodes_ways.id
      where tags_nodes_ways.key == 'name'
      group by tags_nodes_ways.value
      order by num desc) B
where B.value like '%Starbucks%'
;
''')
pprint.pprint(cur.fetchall())
[(429313746, u'name', u'Starbucks Coffee', 6),
 (3833300156L, u'name', u'Starbucks', 1)]

Example 3: name of shops.

  • We could notice that 7-eleven appears as u'7-Eleven', u'7 Eleven', and u'7-eleven'.
In [54]:
cur.execute('''
select * 
from (
      select tags_nodes_ways.id, tags_nodes_ways.key, tags_nodes_ways.value, 
             count(*) as num
      from tags_nodes_ways
      join (select distinct(id) 
            from tags_nodes_ways
            where key == 'shop') A
        on A.id == tags_nodes_ways.id
      where tags_nodes_ways.key == 'name'
      group by tags_nodes_ways.value
      order by num desc
      ) B
WHERE instr(B.value, '7') > 0
;
''')
pprint.pprint(cur.fetchall())
[(3384647299L, u'name', u'7-Eleven', 3),
 (3825249428L, u'name', u'7 Eleven', 1),
 (4901842621L, u'name', u'7-eleven', 1)]

Wrong numbers lead to inaccurate statistics. It would be helpful if a standard naming convention for the mostly used names is provided to the users for reference.

Part 2. User activity analysis


Number of Nodes

In [13]:
cur.execute('''
select count(*)
from nodes;
''')
pprint.pprint(cur.fetchall())
[(256998,)]

Number of ways

In [14]:
cur.execute('''
select count(*)
from ways;
''')
pprint.pprint(cur.fetchall())
[(23891,)]

Number of Unique users

In [15]:
cur.execute('''
select count(distinct(A.user))
from (select user from nodes 
      union all 
      select user from ways) A
;
''')
pprint.pprint(cur.fetchall())
[(251,)]

Top ten contributing users

In [16]:
cur.execute('''
select A.user, count(*) as num
from (select user from nodes 
      union all 
      select user from ways) A
group by A.user
order by num desc
limit 10
;
''')
pprint.pprint(cur.fetchall())
[(u'schleuss_imports', 130738),
 (u'dannykath_labuildings', 35912),
 (u'manings_labuildings', 23468),
 (u'RichRico_labuildings', 12544),
 (u'piligab_labuildings', 11952),
 (u'Luis36995_labuildings', 10488),
 (u'karitotp', 9836),
 (u'yurasi_import', 5556),
 (u'bdiscoe_imports', 4349),
 (u'markbegbie', 4007)]

Number of users appearing less than 3 times

In [17]:
cur.execute('''
select count(*)
from (select A.user, count(*) as num
      from (select user from nodes
            union all
            select user from ways) A
      group by A.user
      having num <= 3) B
;
''')
pprint.pprint(cur.fetchall())
[(96,)]

We observe that contributions from users are highly skewed: 88.59% of the entries are contributed by the following users.

username contributions count percentage
u'schleuss_imports' 130738 46.54%
u'dannykath_labuildings' 35912 12.79%
u'manings_labuildings' 23468 8.35%
u'RichRico_labuildings' 12544 4.47%
u'piligab_labuildings' 11952 4.26%
u'Luis36995_labuildings' 10488 3.73%
u'karitotp' 9836 3.50%
u'yurasi_import' 5556 1.98%
u'bdiscoe_imports' 4349 1.55%
u'markbegbie' 4007 1.43%
TOTAL 248850 88.59%

Most of the users names end with _imports or _labuildings (see Los Angeles County building import). This suggests that the data were actually imported rather than entered manually. We now verify this point.

If the data were imported, we would expect a large amount of data inflow within short amount of time. We now try to visualize the data by date.

In [24]:
# import plotly.plotly as py
# import cufflinks as cf
## interactive graphs.

cf.go_offline()
df_nodes_ways_ct.ymd.dt.date.iplot(kind='histogram',\
                              filename='cufflinks/basic-histogram')

We notice unusually high data contributions in the following periods: Jun and July 2016. I now look at the data contributions by day for each month respectively.

In [26]:
fig = plt.figure(figsize = (10,5))

plt.suptitle('Count of contributions by day', fontsize=16)

ax1 = fig.add_subplot(1,2,1)

ax2 = fig.add_subplot(1,2,2, sharey = ax1)
ax21 = ax2.twinx()

sns.set()
sns.set_style('dark')

sns.distplot(df_nodes_ways_ct[(df_nodes_ways_ct['year'] == 2016) & \
                              (df_nodes_ways_ct['month']==6)]['day'],
            ax = ax1, 
            hist_kws={"histtype": "stepfilled", "linewidth": 0.5,
                                    "alpha": 0.4, "color": "g"})
ax1.axhline( y = 0.36, color = 'g', linestyle = '--')
ax1.set(ylabel='relative frequency', title = 'month = 6')

sns.distplot(df_nodes_ways_ct[(df_nodes_ways_ct['year'] == 2016) & \
                              (df_nodes_ways_ct['month']==7)]['day'],
            ax = ax2, 
            hist_kws={"histtype": "stepfilled", "linewidth": 0.5,
                                    "alpha": 0.4, "color": "g"})
 
sns.distplot(df_nodes_ways_ct[(df_nodes_ways_ct['year'] == 2016) & \
                              (df_nodes_ways_ct['month'].isin([7]))]['day'],
            ax = ax21, kde = False)

ax21.axhline( y = 15000 , color = 'g', linestyle = '--')
ax21.set(ylabel='absolute frequency', title = 'month = 7')
Out[26]:
[<matplotlib.text.Text at 0x1f993ac8>, <matplotlib.text.Text at 0x1f9a0e80>]

There are three days by which the user contribution is greater than 15000 (above the green dotted line).

I wonder for each of these three days, how many contributors there were.

In [31]:
fig = plt.figure(figsize = (10,5))

plt.suptitle('Count of contributions by users, > 15000 contributions', fontsize=16)

ax1 = fig.add_subplot(1,2,1)

ax2 = fig.add_subplot(1,2,2,sharey = ax1)

sns.set()
sns.set_context("notebook")

sns.countplot(x = df_nodes_ways_ct[(df_nodes_ways_ct['year'] == 2016) & \
                              (df_nodes_ways_ct['month']==6) & \
                             (df_nodes_ways_ct['day'].isin([7,18]))]['user'],
             ax = ax1,\
              facecolor = (0,0,0,0),\
             linewidth=5,\
              edgecolor=sns.color_palette("BrBG", 7))

ax1.set(ylabel='count of contributions', title = '7th, 18th of June')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=50)

sns.countplot(x = df_nodes_ways_ct[(df_nodes_ways_ct['year'] == 2016) & \
                              (df_nodes_ways_ct['month']==7) & \
                             (df_nodes_ways_ct['day']==5)]['user'],
             ax = ax2,\
              facecolor = (0,0,0,0),\
             linewidth=5,\
              edgecolor=sns.color_palette("dark", 3))

ax2.set(ylabel=' ', title = '5th, July')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=50)
Out[31]:
[<matplotlib.text.Text at 0x1f6ffe80>,
 <matplotlib.text.Text at 0x1f70bcc0>,
 <matplotlib.text.Text at 0x1f6c12e8>,
 <matplotlib.text.Text at 0x1f6b8da0>,
 <matplotlib.text.Text at 0x1f6411d0>]

From the left-hand graph, we notice that Schleuss is the only contributor for 7th, 18th of June: it seems that the user Schleuss imports data under the name Schleuss_imports and around 63090 entries were imported under the name Schleuss_imports; however, only 170 entries were under the name Schleuss. I think this is a clear sign of automatic versus manual work.

At last, I am intereted to see users are most active at which day in the week, and at which hours during a day.

In [32]:
fig = plt.figure(figsize = (10,5))

# plt.suptitle('Count of contributions by users', fontsize=16)

ax1 = fig.add_subplot(1,2,1)

ax2 = fig.add_subplot(1,2,2)

sns.set()
sns.set_context("notebook")

sns.countplot(y = df_nodes_ways_ct['weekday'], ax = ax1)

ax1.set(ylabel='weekday', title = 'User activity by weekday')
ax1.set_xticklabels(ax1.get_xticks().astype(int), rotation=40)

sns.countplot(y = df_nodes_ways_ct['hour'],
             ax = ax2)

ax2.set(title = 'User activity by hour')
ax2.set_xticklabels(ax2.get_xticks().astype(int), rotation=40)
Out[32]:
[<matplotlib.text.Text at 0x1f3da630>,
 <matplotlib.text.Text at 0x1f3e6f60>,
 <matplotlib.text.Text at 0x1ebdbcc0>,
 <matplotlib.text.Text at 0x1ebde160>,
 <matplotlib.text.Text at 0x1ebd1828>,
 <matplotlib.text.Text at 0x1ebc63c8>,
 <matplotlib.text.Text at 0x1ebde7f0>,
 <matplotlib.text.Text at 0x1ebdee80>]
Left-hand Graph:

0: Monday. 1: Tuesday. 2: Wednesday. 3: Thursday. 4: Friday. 5: Saturday. 6: Sunday.

Most of the data were entered on Tuesday, which is followed by Saturday and Thursday. However, we have to take this result with caution given that most of the data is imported.

Rght-hand Graph:

OpenStreetMap (OSM) is a collaborative project and maintained by volunteers. Hence, it makes sense to observe that most of the contributions are not made at working time, i.e., from 8:00 to 18:00.

Final thoughts

Suggestion 1. Providing a data format standard for users to follow.

During data auditing and descriptive map data analysis, we notice that consistency of data could be improved if there exists a standard which users can follow.

  • Expected improvements:
    • Detailed standard ensures data accuracy.
  • Expected problems:
    • If the standard is not clear or not easy to follow, it will be time consuming for users to implement the format. Furthermore, strict rules to obey can greatly dampen users' initiatives.
    • Completing map data is voluntary and tedious job, users enthusiasm may not last. Therefore, the question becomes to what extent, and how, we could motivate the users.
Suggestion 2: Encouraging teamwork to increase user interactions.

I think we could try to keep users involved by increasing collabrations among users. I suggest that OpenStreetMap could request the users to complete their profiles by providing some personal information. Based on this information and users' preference, OpenStreetMap could match people of similar traits and recommend that they work in group.

  • Expected improvements:
    • I think this provides motivations for the users to commit to the job for longer time.
    • The data input procedure could be cross checked by several users in the group, so that the data quality could be increased.
  • Expected problems:
    • It may not be easy to implement this user-interaction utility in short time.
    • I think it would be somehow awkward if it turns out that there is no chemistry in the team, which can again have negative impact on users' iniatives.
      • Use of gamification to get users involved could be an alternative.